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Abstract 

This study focuses on an alignment-free sequence comparison meth- 
od: the number of words of length k shared between two sequences, 
also known as the D2 statistic. The advantages of the use of this 
statistic over alignment-based methods are firstly that it does not as- 
sume that homologous segments are contiguous, and secondly that 
the algorithm is computationally extremely fast, the runtime being 
proportional to the size of the sequence under scrutiny. Existing ap- 
plications of the D2 statistic include the clustering of related sequences 
in large EST databases such as the STACK database. Such applica- 
tions have typically relied on heuristics without any statistical basis. 
Rigorous statistical characterisations of the distribution of D2 have 
subsequently been undertaken, but have focussed on the distribution's 
asymptotic behaviour, leaving the distribution of D2 uncharacterised 
for most practical cases. The work presented here bridges these two 
worlds to give usable approximations of the distribution of D2 for 
ranges of parameters most frequently encountered in the study of bi- 
ological sequences. 

1 Introduction 

The accelerating rate of accumulation of molecular sequences in public databases 
has triggered the development of a number of sequence comparison algorithms. 
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The most popular algorithms, such as FASTA, BLAST or BLAT, rely on sequence 
alignment, and assume contiguity between homologous segments. This assumption 
is, however, often broken in molecular sequences, due to events such as transposi- 
tion, unequal crossing over or alternative splicing. To address this issue, a number 
of alignment-free sequence comparison methods have been developed. Amongst 
them, the count of words of length k letters matching between two sequences, 
also known as the D2 statistic, has found some successful applications, due to its 
simplicity and its speed. The algorithm to calculate the D2 statistic between two 
sequences runs as a linear function of the sequences' lengths, whereas alignment- 
based sequence comparison methods typically have a worst case runtime quadratic 
in the sequences' lengths. The first applications of the D2 statistic relied on heuris- 
tics to decide whether sequences are significantly similar, but did not have any 
statistical basis. 

A rigorous examination of the distribution of D2 led to the characterisation 
of asymptotic distributions, but the behaviour of D2 in practical cases remains 
unknown. In a previous study, we characterized D2 optimal word sizes for a range 
of sequence sizes. The goal of the present study is to find approximations of the 
distribution of D2 for word sizes close to optimal, and for the sequence sizes most 
frequently encountered in molecular databases. 



2 Background 

The D2 statistic is defined to be the number of exact word matches of length 
k between sequences A = (^1, . . . , A^n) and B = {Bi, . . . , S„), with Ai and Bj 
belonging to a given alphabet A. For mathematical convenience we will impose 
periodic boundary conditions on both sequences, that is, the letter in the first 
position in sequence A is assumed to follow the letter in the mth position, and the 
letter in the first position in sequence B is assumed to follow the letter in the nth 
position. For k « m,n we do not expect our results to differ significantly from 
the usual case of free boundary conditions. 



Defining the indicator variables Y(^ij^ for a word match between the /c-word at 
position z in A and the word at position j in B by 

Y = J ^ ■ ■ ■ ' "^(i+k-l) mod m) = (-^j) • • • > ^{j+k-l) modn) q\ 

('■^'^ \ Otherwise, ^ ' 

the D2 statistic is given by 

(iJ)G/ 

where / = {(i, < i < m,l < j < n}. For the case of free boundary conditions 
the index set is replaced by I = {(i, j)|l < i < m — -|- 1, 1 < j < ra — A; + 1}. 
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We are interested in the distributional properties of D2 under the nuh hypothesis 
that A and B are Bernouhi texts, meaning that each letter, Ai or Bj, is indepen- 
dently and identically (i.i.d.) distributed. Let the probability of occurrence of 
letter a & A he fa, and define 

Pt = Y.f-' t = l,2,.... (3) 
The mean of D2 is then |7]: 



EiD2) = = mn 




An exact value for the variance of D2 has recently been given for the case of free 
boundary conditions in [6]. In Appendix I we derive a similar formula for the 
variance for the algebraically simpler case of periodic boundary conditions. 

From here on, to simplify matters we set m = n. Rigorous results exist for the 
limiting distribution of D2 as n ^ 00 in certain regimes. For pairs of Bernoulli 
texts with non-uniform letter distributions, the limiting distribution is compound 
Poisson in the regime k > 21ogf,n -|- const. [7], and normal in the regime k < 
1/2 logf, n -|- const [2j. B.er:eb = p2~^- 

In earlier numerical analyses [1], we tested the accuracy with which /c-word 
matches are able to measure the relatedness of artificially evolved sequences. Cal- 
culations of the optimum word size k for a range of sequence lengths n, showed that 
optimum word sizes generally fall between the two parameter regimes for which 
the asymptotic behaviour of D2 is known. Our purpose here is to perform nu- 
merical experiments to fill in the gap in the biologically relevant parameter regime 
between the asymptotically normal and compound Poisson asymptotic behaviours, 
and to find accurate and practical approximations to the distribution of D2 in this 
parameter region. In particular, we are concerned with accurately reproducing 
the region of the tail corresponding to classical significance levels (0.001%, 0.01%, 
. . . ) , both for the distribution of D2 , and for its extreme value distribution that is 
used for determining p-values in database searches. 

3 Simulations of the empirical distribution 

The distribution of D2 was simulated for a number of combinations of sequence 
size n, word size k, alphabets A and sequence composition fa- For nucleic acid 
sequences, word sizes close to the optimal word size were chosen, based on com- 
putation of the optimal word size of D2 [4]. We focused on sequence sizes typ- 
ical of ESTs, whole genome shotgun sequencing trace pairs, CDSs, and mRNAs 
(100 < n < 3200 bases). For protein amino acid sequences, the optimal word sizes 
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and a letter composition equal to the average of the proteins encoded by the hu- 
man genome where determined using the same method. For protein sequences of 
length up to n = 400 the optimum word size was k = 3, and for longer sequences 
up to n = 3200 the optimum word size was k = A. The sequence sizes for proteins 
ranged from small peptides to large proteins (10 to 2560 residues). Sequences were 
simulated with uniform and non-uniform letter distributions. 

For each combination of parameters, A'sampie = 10^ pairs of Bernoulli text se- 
quences were generated. The extreme value distribution was simulated by taking 
the largest value of 100 comparisons A^sampie times. The code for the simulations 
was written in ANSI C and is available from the author's website ^ . 

4 Comparison between empirical and hypoth- 
esised distributions 

Previous studies of the D2 statistic used Kolmogorov-Smirnov tests [H] to com- 
pare the empirical distribution of D2 with its theoretical asymptotic distributions 
(normal or compound- Poisson) [TJH]. These studies, however, have been in error 
for the following reason. Care must be taken when using the Kolmogorov-Smirnov 
test to pre-specify the parameters of the distribution being compared. If instead, 
parameters are estimated from the empirical distribution, the p-values obtained 
will be overestimated (see Appendix II). Given that these earlier studies generally 
pre-dated the discovery of an analytic formula for the variance of D2-, they relied 
on means and variances estimated from empirical samples, and therefore led to 
overly optimistic claims of agreement between the distribution of the D2 statistic 
for finite length sequences and its theoretical asymptotic limit. 

We have repeated Kolmogorov-Smirnov tests of our empirically generated data, 
standardised with the analytically determined mean and variance of 02, against 
the standard normal distribution. In general, we find p-values to be smaller than 
those reported in earlier studies. Similar results were obtained using the Shapiro- 
Wilk test, which tests for normality but does not require prior knowledge of the 
mean or variance. More importantly, we find that the information provided by 
such comparison is rather limited, as the p-value of the Kolmogorov-Smirnov test 
decreases noticeably with the sample size A'^sampiej since the true distribution of 
D2 for finite sequence length n never exactly matches the hypothesised limiting 
distribution. We conclude that this type of measure does not give a panacea for 
how well (or how badly) a given hypothesised distribution will approximate the 
distribution of D2. 

Most practical uses of the D2 statistic involve the calculation of a p-value re- 
sulting from the comparison of two sequences or from the comparison of a query 
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sequence to a sequence database. Our approach therefore is to compare a hypoth- 
esised distribution with an empiricahy generated distribution of D2 based on a 
direct comparison of the p-values obtained with these two distributions. If the 
p-values of a given hypothesised distribution agree well with those of the empiri- 
cal distribution, this hypothesised distribution could be used to approximate the 
relevant tail of the real distribution of D2 ■ 



Suppose we wish to compare a postulated distribution function Fhyp with an 
empirically generated sample {xi, . . . ,X]\f^^^^^^}. To evaluate how accurately p- 
values predicted by F^yp would approximate those of the true distribution of D2, 
the quantiles 

^hyp = -?^hyp(l-^'hyp) (5) 

are first calculated for to a number of p-values, Phyp- The frequency, in the simu- 
lated data, of the occurrences of D2 greater than ghyp then provides an empirical 
p- value, 

_ \{xi : Xj > ghypll . . 

Pemp — ,x • vOj 

'^sample 



This is compared to Phyp- 

S = log 



( Pemp \ 
Phyp J 



(7) 



The comparisons focussed on p-values in the range of classical significance levels 
(Phyp £ {0.001%, 0.01%, .. .}). The theoretical distributions were parameterized 
using the exact values of 1)2 's mean and variance. Zero values of Pemp were replaced 
by l/A'sampie- The hypothesised distributions considered were the normal and 
gamma distributions. The process is illustrated in Fig. [H 



When doing database searches, a query sequence is compared to several se- 
quences, and the p-value of the best score of all these comparisons needs to be 
estimated. The relevant statistic in this case is the extreme value, that is, the 
maximum of a number of i.i.d. random variables. In addition to evaluating the tail 
of the distribution of D2 itself, the tail of the empirical extreme value distribution 
of D2 was also compared to those of the the normal and the gamma distributions. 
These two extreme value distributions belong to the Gumbel family and can be 
easily computed (see Appendix III). 



5 Results 

5.1 Approximating the distribution of D2 

We first assessed the approximation of the distribution of D2 with the normal 
distribution. Figure [2] shows the results of the comparison of the p-values in the 
case of nucleic sequences with a uniform letter distribution. Similar results were 
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Figure 1: Distribution of D2 for n = 800, k = 7. The histogram shows the 
empirical distribution, and the continuous curve is a hypothesised gamma 
distribution with mean given by Eq. H] and variance from the calculation 
in Appendix I. (a) Global view of the distribution, (b) Detail of the right 
hand tail of the distribution, the vertical bar shows the quantile qhyp, the 
area under the curve is the corresponding level Phyp, the hatched area is the 
empirical level Pemp- 

obtained with non-uniform letter distributions. For sequences 1600 base pairs long 
or larger, the p-values from the hypothesised normal distribution were very close 
to the empirical p-values. For smaller sequences and large p-values (up to 1%), the 
normal and empirical p-values were of the same order of magnitude. For smaller p- 
values, the hypothesised normal distribution greatly overestimated the significance 
of D2. A few other distributions were compared to the simulated distribution of 
D2- The gamma distribution, in particular, approximated the distribution of D2 
better than the normal distribution did. In this case, the real p-values tended to 
be overestimated, and the relative difference increased as the p-values decreased 
(figure E]). 

The trends were identical for the amino acid alphabet (figure S]) : the normal 
distribution approximates the p-values relatively well for large sequences and mod- 
erate significance levels, but for shorter sequences and further into the tail of the 
distribution, p-values were strongly overestimated. The gamma distribution gen- 
erally underestimated the p-values, but was closer to the simulated distribution of 
D2 (figure ED. 

5.2 Extreme value distribution 

Figure [6] shows the results of comparison between the extreme value distribution 
of D2 , and the extreme values of the gamma and normal distributions in the case 
of a uniform nucleotide letter distribution for phyp in the range 0.1% to 10%. The 
extreme value distribution of D2 is generally better approximated by the maximum 
of gamma distributions. Since it was noted in the previous section that the relative 
difference between the distribution of D2 and the normal or the gamma distribution 
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Figure 2: Normal distribution versus empirical distribution of D2, DNA al- 
phabet with uniform letter distribution. Each table compares the two dis- 
tributions at a given level of the hypothesised distribution, for a number of 
combinations of sequence lengths n and word sizes k. The value in each cell 
corresponds to the empirical level. The colour of each cell reflect the value 
of S, as introduced in Eq. [71 
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Figure 3: Gamma distribution versus empirical distribution of D2, 
alphabet with uniform letter distribution. See legend of figure [21 
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Figure 4: Normal distribution versus empirical distribution of D2, amino acid 
alphabet. See legend of figure [21 
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Figure 5: Gamma distribution versus empirical distribution of D2, amino 
acid alphabet. See legend of figure [21 



10 



increased as the p-values decreased, it is not surprising that the approximations 
of the extreme value distribution of D2 are not as good as the approximations to 
the distribution of L>2- The same trends were observed for nucleic sequences of 
non-uniform letters and amino acid sequences. 

6 Discussion and Conclusions 

This study introduces practical approximations to the distribution of the D2 
statistic and to the extreme value distribution of D2- For sequences of intermedi- 
ate length (around 800 base pairs, close to the average size of ESTs and sequencing 
traces) and for p-values between 5% and 0.1%, the Gamma distribution closely ap- 
proximates the distribution of D2- The Gamma distribution not only outperforms 
the normal distribution, but unlike the latter, it slightly overestimates the p-values, 
and thus would result in fewer false positives. 

All the approximations presented here deteriorate as one moves further to the 
right hand of the tail (for smaller p-values). This is not, however, a major problem 
for any practical use of these approximations, where very small p-values would just 
have an indicative value. 

Finally, our results show, that for longer sequences, such as genome assembly 
contigs, the normal approximation itself would be appropriate, even for very small 
p-values. 

Appendices 

I. Calculation of Var D2 

Using Eq. [21 the variance of D2 is 

Var(Z)2)=Var I ^ %) | = J] Var(%))+ ^ Gov (%), y(,,,,)). 
\(ij)6/ / (i,i)e/ {himi'd') 

(8) 

To simplfy the notation from here on we set u = v = The first term 

in Eq. [8] depends only on 

Var (y„) = i?(y„2) _ (i^(y„))2 = ^(y^) _ (^(yj)2 ^ _ ^^2fc^ 
where pt is defined in Eq. [3l Thus 

Var (y„) = mn (pa' - • (10) 
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Figure 6: Extreme value of Normal and Gamma versus empirical extreme 
value of D2, DNA alphabet with uniform letter distribution. See legend of 
figure [21 
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To calculate the covariances in the second term of Eq. [8l it is convenient to use 
the notation and terminology of [8], Chapter 11. Let Ju = {v = {i' , j') ■ \i' — i\ < 
k or |j' — j\ < k} be the dependency neighbourhood of Yu- It can be decomposed 
into two parts, accordion and crabgrass, Ju = Ju-> where 

Ju = {v = ^Ju ■■ \i' -i\<k and \j' - j\ < k}} and 3^ = J^\ J^. 

We compute the cross covariances, Gov Yy), by looking at the following cases. 
Case 1: v ^ J^. In this case, Y^ and Y^ are independent and hence Gov (1^, 1^) = 
0. 

Case 2: ?; E J^. Let u = and v G J^. Consider first the subcase v = {i+t,j'), 
where |j — j'\ >k and < t < A; — 1. Then 

E{Y^Y,) = Pr(y„ = l,n = l) 

~ ^ ^ (/ai • • • /afe+t)(/ai • • • fak)(.fai+t • ■ ■ fcik+t) 

(ai,...,afe+i)e^ 

(\ 2t / \ k—t 



(11) 



where we have used the fact that word matches occur simultaneously at u and v if 
and only if the first k letters of {Ai, . . . , Ai+k+t-i) are repeated at {Bj, . . . , Bj^k-i) 
and the final k letters are repeated at (-Bj', . . . ,-Bj'+fc_i). This gives 

2k 



Gov = E{YuY,) - E{Yu)E{Y,) = ps'^Ps'^"* 

Extending the argument toall— A: + l<t</i; — 1 gives 

Cov(y„,n)=P2^iV-i*i-P2^^ 



P2 



(12) 



(13) 



By symmetry of the covariance function, the same result applies to the sub-case 
V = {i' ,j + t) where \i — i'\ > k and \t\ < k — 1. 

The crabgrass contribution to the sum over covariance terms in Eq. [H] is then 

^^Gov(y„,y,) 



k-l 



El E + E 

Ji':|i'-il>fc} {i':\i'-i\>k}/ t=~k+l 

fc-1 

mn{m + n — 4A; + 2 



P2 



2k\ 



P,'+2Y,P2'W~' 



{2k - l)p2 



2k 



t=l 



mn{m + n — 4/c + 2) 



k , r, 2 P3 
P3 + 2P2 P3 — 



k-l 



P3 - P2' 



n 2(fc-l) 

--{2k-l)p,^^ 



(14) 
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Figure 7: The main diagonal of J" referred to in Case 3 (black squares), and 
the sub-regions I to VI referred to in Cases 4 and 5. 



Case 3: v is on the main diagonal of J^. That is, v = (i + t, j + t), where 
—k<t<k and t (see Fig. [7|). In this case, 

E{YM = Pr(y„ = i,y, = 1) 

= Pr(a specific {k + |t|)-word match at the (i, j) position) 



/ai X ... X fak+\t\ 



(ai,...,afe+|,|)ey4'=+l*l 



(15) 



and 



Gov (y„, Y,) = E{YuY,) - E{Yu)E{Y,) = ^2^^+'*' - ^2'^ (16) 
The contribution to the sum over covariance terms in Eq. [8] from Case 3 is then 

Gov (y,,y,) 

u (jgmain diagonal, v^i^it 
k-l 



t=i 



2mn 



P2 



k+l 



1 -P2 



1 -P2 



(17) 



Case 4: V one of the subregions I, II, III or IV of in Fig. That is, 
V = {i + t, j + s), where 
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I: 0<s<t<A;-l; 

II: -A; + 1 < s < t < 0; 

III: -A;+l<t<s<Oor 

IV: < t < s < A; - 1. 

Consider subregion I first. The word matches corresponding to the event "1^ = 
1,Y^ = 1" are illustrated in Fig. [8l For such a situation to occur, the t + s 
letters oi, . . . , Og, bi, . . . ,bs and ci, . . . , ct-s can be specified independently, and 
the remaining 2k letters within the four words must be repeats of ci, . . . ,ct—s as 
shown. The sequence ci, . . . is repeated v = [{k — s)/(t — s)J complete times 
in sequence B and v + 1 complete times in sequence A, where [ J indicates the 
integer part. At the right hand end of these repeats, the sequence ci, . . . , Cp occurs 
once in Sequence A and once in Sequence B, where p = {k — s) mod (t — s). 

Then 

E{Y^Y,)=Vi{Y^ = l,Y, = l) 

Ef 2 f 2f 2u+3 f 2u+3 ^ 
Jai • • • J as J ci ■ • • Jcp ^ 

(ai,...,as,bi,...,bs,ci,...,ct-s)GA^+'' 

f 2u+l f 2!/+l f 2 f 2 
Jcp+i ■■■Jct-s Jbi ■■■Jbs 



\a£A / \ceA J \ci^A / 



t—s—p 



KbeA 



P2^'p2u+3''P2v+l ' ^, 



and 



Gov Y,) = E{YuY,) - E{Yu)E{Y,) = P2''P2u+3''P2u+ 



t—s—p 



P2 



2k 



(18) 



(19) 



It is straightforward to check that similar results apply to subregions II, III 
and IV, giving the contribution to the sum over covariances in Eq. [8] from Case 4 
as 



fc-l i-l 

^ Gov (y„, Y,)=AnmY,Y. {p2^'P2u+3'P2u+i'-'-'' - P2^' 

u v£R t=l s=0 

where i? = I U II U III U IV is the union of the four subregions of Case 4 and 

k — s 



(20) 



t - s 



p = {k — s) mod {t — s). 



(21) 



Case 5: V (z one of the subregions V or VI of in Fig.^ That is, v = {i+t,j+s) 
where 
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A: 







Cj, . . ., c^_^ 







B: 



a^, . .., 


Ci, Cf_g 


Ci, Cf_g 




bi, ...,b^ 



Figure 8: Word match configuration corresponding to Case 4(1). If tlie letters 
ai, . . . , tts, bi, . . . ,bs and ci, . . . , Ct-s are specified, the remaining letters within 
the four words must be repeats of ci, . . . , Ct-s as shown, the final repeat being 
truncated at the same point in both Sequence A and Sequence B. 
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A: 



a^, ... , 


bi, ...,b^ 


a^, ... , 


bi, ...,b^ 





B: 



bi, ...,b^ 


a^, ... , 


bi, ...,b^ 





Figure 9: Word match configuration corresponding to Case 5(V). If the letters 
Oi, . . . , Of, bi, . . . ,br are specified, the remaining letters within the four words 
must be repeats of ai, . . . , 6,. in Sequence A and fei, . . . , in sequence B, the 
final repeat being truncated at the {k + t)th or (A; + r)th position respectively. 
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V: 1 < t < A; - 1, -A; + 1 < s < -1; 

VI: 1 < s < /c - 1, -A: + 1 < t < -1. 

Consider subregion V first. The word matches corresponding to the event 
"Yu = 1,1^ = 1" are illustrated in Fig. [9l Set r = —s. For the event to occur, the 
affected block of length t + k in Sequence A must consist of repeats of a sequence 
{ai, . . . , at,bi, . . . ,br) where oi, . . . , and bi, . . . ,br are independently specified 
letters. The final repeat is truncated at the {k + t)th letter. The affected block in 
sequence B must consist of repeats of the sequence {bi, . . . ,br, ai, . . . , at), the final 
repeat being truncated at the (A: + r)th letter. 



Let L 



1, . . . ,t be the total number of times the letter occurs and m 



j = 1, . . . , r be the total number of times the letter bj occurs in the two blocks in 
Fig. [9l By noting that the stretches of length k not including the first t letters of the 
A-block or not including the first r letters of the B-block each contain [_k/{r + 1)\ 
complete repeats of all s+t independent letters plus a final k mod (r+t) remaining 
letters at the right hand end, we arrive at 



k 



rrii 



1 + 2r/ + 



1 + 2r/ + 



if i < C 
otherwise 

ifj<C 
otherwise 



+ 



+ 



if i < C — r 
otherwise 

ifj<C-t 
otherwise 



(22) 



where 



Then 



r + t 



C, = k mod (r + t) 



(23) 



E{YuY,) = Vt{Yu = 1,Y, = 1) 



{ai,...,at,bi,--;br)eA*^ 



\aeA / \aeA / \beA / 




\i=l / \j=l 



(24) 



and 



Gov (y„,y,) = EiYM - E{YMYv) = [Thi^j ( n 



(25) 
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A similar result holds for subregion VI. The contribution to the sum over 
covariances in Eq. [8] from Case 5 is then 

fc-i 

J] J^Cov(y„,y,) = 2nm J] 

u v£S r,t=l 

where S" = V U VI is the union of the two subregions of Case 5 and li and nij are 
given by Eq. [22l 

Finally, by Eq. [8l the variance of D2 is given by the sum of the right hand 
sides of Eqs. [TOl El El EQ] and [26l 

II. Consequences of failing to pre-specify parameters in 
the Kolmogorov-Smirnov test 

Given a random sample of observations Xi, X2, ■ ■ ■ , Xjy^^^^^^ , the Kolmogorov- 
Smirnov test [3] gives p-values for the null hypothesis that the observations are 
associated with pre-specified distribution function -Phyp- The two-sided version of 
the test considered here uses as a test statistic supj |Fhyp(Xj) — S{Xi)\, where 5 is 
the empirical cumulative distribution function based on the observations. Under 
the null hypothesis the p-values obtained are uniformly distributed on the interval 
[0,1]. 

Importantly, if the hypothesised distribution Fhyp is not fully pre-specified, but 
relies on estimates from the sample, the reported p-values will not be uniformly 
distributed under the null hypothesis. To illustrate this, we have generated a set of 
10,000 independent samples of A'^gampie = 2500 random numbers from a standard 
normal distribution, and applied the two-sided Kolmogorov-Smirnov test to each 
sample using the R function ks.test. Histograms of the p-values obtained are 
shown in Fig. [TUl In the first plot each sample was tested against the standard 
normal A^(0, 1), whereas in the second plot each sample was tested against a normal 
distribution whose mean and variance was estimated from the sample. We see that 
in this situation, where the null hypothesis is true, but the Kolmogorov-Smirnov 
test is applied incorrectly, p-values are skewed heavily towards 1. 

In a second test to see whether incorrect use of the Kolmogorov-Smirnov test 
can lead to an overly optimistic indication of agreement with a hypothesised dis- 
tribution, we generated a set of 10,000 independent samples of A^sampie = 2500 
random numbers from a Gamma distribution with mean 10 and variance 1. This 
distribution is close to, but not identical with, the normal distribution A^(10, 1). 
The third and fourth histograms in Fig. [10] are of p-values obtained from applica- 
tion of the Kolmogorov-Smirnov test against a pre-specified A^(10, 1), and against 
a normal distribution with mean and variance estimated from the sample respec- 
tively. The third plot is an indication the distribution of p-values that will result if 
the Kolmogorov-Smirnov test for normality is applied correctly to this non-normal 



P2 



2k 



(26) 
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KS P-value: N(0,1), parameters pre-specified 



KSP-value: N(0,1), 




KS P-value: Gamma(10,1] data, parameters pre-specified 




KS P-value: Gamma(10,1) data, parameters estimated 




Figure 10: Histograms of p-values obtained from the Kolmogorov-Smirnov 

test applied to artificially generated data tested against a normal distribution. 
From top to bottom the plots are (i) standard normal data, parameters of the 
hypothesised distribution pre-specified; (ii) standard normal data, parame- 
ters of the hypothesised distribution estimated from the data; (iii) gamma 
distributed data with mean 10 and variance 1, parameters of the hypothe- 
sised distribution pre-specified; and (iv) gamma distributed data with mean 
10 and variance 1, parameters of the hypothesised distribution estimated 
from the data. 
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data. Again we see that p-values are overestimated in the fourth plot when the 
test is used incorrectly. 



III. Limiting distribution of the maximum of N i.i.d. 
random variables 

We are interested in the limiting distribution for N large of the random variable 

-^max = maxXi, (27) 

where Xi,i = 1, . . . ,N are i.i.d. random variables with common density function 
fx and distribution function 

Fx{x) = r fxim. (28) 

The general theory of extreme value distributions is given in the book by Gum- 
bel [5], Chapter 5.2. For distributions of "type I", which includes the normal 
and gamma distributions, the distribution function of Xmax) namely {Fx{x))^ , 
asymptotes to the double exponential distribution function 

G{x) = exp(-e-^), (29) 

where the reduced largest value is defined as 

y = aN{x-UN)- (30) 

Here ujy, called the characteristic largest value, is determined by the condition 
that in N observations of X, the expected number of values greater than or equal 
to UN is unity. It is the solution to the equation 

Fx{un) = 1-^, (31) 

and for the case of the normal and gamma distribution is easily found using the 
R function qnormO and qgammaQ respectively. The parameter is called the 
extremal intensity function and is given by 

aN = NfxiuN). (32) 
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